BREAST CANCER WISCONSIN DATA SET.

Katyna Sada

2/17/2021

1. Introduction

Cancer is the second leading cause of death globally. In 2018 there were approximately 9,6 million deaths worldwide.These numbers are rising and on 2040 the expectation is to have up to 29,5 million new cases. Regardless of the type of cancer present, cancer cells are distinguished for growing out of control and becoming invasive (WHO). They harness the space and nutrients that healthy organs require, that could then trigger the malfunctioning of certain body systems (NIH).

Breast cancer is a type of cancer that starts in the breast and occurs almost entirely in women, 2,1 million of new cases were estimated on 2018 (WHO). A total of 44,130 deaths from breast cancer are estimated to occur on 2021. (ASCO).

There is a need to search for new techniques in order to correctly diagnose and treat breast cancer.

The aim of this project is to develop an algorithm that correctly predicts if a sample of breast cancer cell nucleus is benig or malignant. The aforementioned is going to be performed by comparing various machine learning algorithms taking into account its accuracy and its time of execution.

2. Dataset

This dataset is hosted on Kaggle (Breast Cancer Wisconsin (Diagnostic) Data Set), and it was from UCI Machine Learning Repository.

The dataset contains 30 features plus the ID number and diagnosis of the participants. The features were computed by measuring 10 parameters on each cell nucleus from a digitized image of a fine needle aspirate (FNA) of a breast mass. For each parameter, the mean, standard error and "worst" (mean of the three largest values) were obtained.

A breast fine needle aspiration (FNA) removes some fluid or cells from a breast lesion (a cyst, lump, sore or swelling) with a fine needle similar to that used for blood tests. The sample of fluid or cells (or tissue) is then examined (Borecky).

Real-valued features:

  1. radius (mean of distances from center to points on the perimeter)
  2. texture (standard deviation of gray-scale values)
  3. perimeter
  4. area
  5. smoothness (local variation in radius lengths)
  6. compactness (perimeter^2 / area - 1.0)
  7. concavity (severity of concave portions of the contour)
  8. concave points (number of concave portions of the contour)
  9. symmetry
  10. fractal dimension (“coastline approximation” - 1)

The main libraries required for the development of this project are shown on the chunk below.

The database used can be visualized with the table shown below. The sample ID and an empty column were eliminated as they have no relevance in the experiment.

3. Exploratory data analysis

Variable type

First, a descriptive exploration of the variables was performed in order to understand them better and detect possible problems.

Data summary
Name data
Number of rows 569
Number of columns 31
_______________________
Column type frequency:
factor 1
numeric 30
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
diagnosis 0 1 FALSE 2 B: 357, M: 212

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
radius_mean 0 1 14.13 3.52 6.98 11.70 13.37 15.78 28.11 ▂▇▃▁▁
texture_mean 0 1 19.29 4.30 9.71 16.17 18.84 21.80 39.28 ▃▇▃▁▁
perimeter_mean 0 1 91.97 24.30 43.79 75.17 86.24 104.10 188.50 ▃▇▃▁▁
area_mean 0 1 654.89 351.91 143.50 420.30 551.10 782.70 2501.00 ▇▃▂▁▁
smoothness_mean 0 1 0.10 0.01 0.05 0.09 0.10 0.11 0.16 ▁▇▇▁▁
compactness_mean 0 1 0.10 0.05 0.02 0.06 0.09 0.13 0.35 ▇▇▂▁▁
concavity_mean 0 1 0.09 0.08 0.00 0.03 0.06 0.13 0.43 ▇▃▂▁▁
concave.points_mean 0 1 0.05 0.04 0.00 0.02 0.03 0.07 0.20 ▇▃▂▁▁
symmetry_mean 0 1 0.18 0.03 0.11 0.16 0.18 0.20 0.30 ▁▇▅▁▁
fractal_dimension_mean 0 1 0.06 0.01 0.05 0.06 0.06 0.07 0.10 ▆▇▂▁▁
radius_se 0 1 0.41 0.28 0.11 0.23 0.32 0.48 2.87 ▇▁▁▁▁
texture_se 0 1 1.22 0.55 0.36 0.83 1.11 1.47 4.88 ▇▅▁▁▁
perimeter_se 0 1 2.87 2.02 0.76 1.61 2.29 3.36 21.98 ▇▁▁▁▁
area_se 0 1 40.34 45.49 6.80 17.85 24.53 45.19 542.20 ▇▁▁▁▁
smoothness_se 0 1 0.01 0.00 0.00 0.01 0.01 0.01 0.03 ▇▃▁▁▁
compactness_se 0 1 0.03 0.02 0.00 0.01 0.02 0.03 0.14 ▇▃▁▁▁
concavity_se 0 1 0.03 0.03 0.00 0.02 0.03 0.04 0.40 ▇▁▁▁▁
concave.points_se 0 1 0.01 0.01 0.00 0.01 0.01 0.01 0.05 ▇▇▁▁▁
symmetry_se 0 1 0.02 0.01 0.01 0.02 0.02 0.02 0.08 ▇▃▁▁▁
fractal_dimension_se 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.03 ▇▁▁▁▁
radius_worst 0 1 16.27 4.83 7.93 13.01 14.97 18.79 36.04 ▆▇▃▁▁
texture_worst 0 1 25.68 6.15 12.02 21.08 25.41 29.72 49.54 ▃▇▆▁▁
perimeter_worst 0 1 107.26 33.60 50.41 84.11 97.66 125.40 251.20 ▇▇▃▁▁
area_worst 0 1 880.58 569.36 185.20 515.30 686.50 1084.00 4254.00 ▇▂▁▁▁
smoothness_worst 0 1 0.13 0.02 0.07 0.12 0.13 0.15 0.22 ▂▇▇▂▁
compactness_worst 0 1 0.25 0.16 0.03 0.15 0.21 0.34 1.06 ▇▅▁▁▁
concavity_worst 0 1 0.27 0.21 0.00 0.11 0.23 0.38 1.25 ▇▅▂▁▁
concave.points_worst 0 1 0.11 0.07 0.00 0.06 0.10 0.16 0.29 ▅▇▅▃▁
symmetry_worst 0 1 0.29 0.06 0.16 0.25 0.28 0.32 0.66 ▅▇▁▁▁
fractal_dimension_worst 0 1 0.08 0.02 0.06 0.07 0.08 0.09 0.21 ▇▃▁▁▁

After visually analyzing the distribuitions of all the features, no abnormalities seem to be found. Other than the target variable, which is a factor, all variables are numeric. The type of variable is correctly assigned to all features.
In addition, there are no missing values.

Dataset features

Target variable

The target variable indicates whether the cancer is benign (B) or malignant (M). As seen before, there are 357 benign samples and 212 malignant samples.

Posteriorly, boxplots of all the features were created in order to visualize its importance on classification. These boxplots are grouped in mean, standard deviation and “worse”.

Mean BOXPLOTS

Other than the fractal_dimension_mean boxplot, there seems to be a significant difference in the value off all features when comparing benign and malignant samples. There are many samples that seem to have outliers in the features values. In addition, the range of values varies a lot between features. For example, the mean area of a sample can have a maximum value of 2500 while the highest fractal dimension of a sample doesn’t reach 0,1.

Correlation

Because of the source of the features, a correlation is obviously expected. As seen on the correlation heatmap the highest correlation is found between the perimeter, area and radious.

4. Division of data in training and testing & data preprocessing

Stratified sampling

As seen before, there are more benign that malignan samples. Stratified sampling was used in order to guarantee the same proportion in each class than the one had in the complete data set.

## 
##         B         M 
## 0.6274165 0.3725835
## 
##         B         M 
## 0.6283186 0.3716814
## 
##        B        M 
## 0.627193 0.372807

5. Feature Engineerig

Some transformations were performed on raw data with the aim of improving the performance of the algorithms.

The recipe object was created in order to be able to pre-process data.

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         30

Imputation of missing values

There are no missing values, thus there is no need to imputate.

Variables with variance close to zero

As seen on the table below, all predictors have a significant variance. Hence, no features were removed based on this criteria.

Standardization and scaling

The boxplots demonstrated the need for centering and scaling. The difference in scales can have a great impact in the model. The step_normalize function centers and scales the data.

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         30
## 
## Operations:
## 
## Centering and scaling for all_numeric()

Once the recipe object has been created with the preprocessing transformations, they are learned with the training data and applied to the two sets.

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         30
## 
## Training data contained 456 data points and no missing data.
## 
## Operations:
## 
## Centering and scaling for radius_mean, texture_mean, ... [trained]

A “glimpse” of the values of the features after applying the transformations is shown below.

## Rows: 456
## Columns: 31
## $ radius_mean             <dbl> 1.1026662, 1.8353345, 1.5854321, -0.7630822, …
## $ texture_mean            <dbl> -2.0368159, -0.3414951, 0.4568427, 0.2572583,…
## $ perimeter_mean          <dbl> 1.27735149, 1.69374265, 1.57418479, -0.586926…
## $ area_mean               <dbl> 0.9838048, 1.9037595, 1.5555920, -0.7567497, …
## $ smoothness_mean         <dbl> 1.59709014, -0.82451799, 0.96398998, 3.330921…
## $ compactness_mean        <dbl> 3.33607335, -0.48595341, 1.07505329, 3.457096…
## $ concavity_mean          <dbl> 2.69220042, -0.01326369, 1.38895856, 1.947309…
## $ concave.points_mean     <dbl> 2.58556012, 0.57146330, 2.08288679, 1.4885803…
## $ symmetry_mean           <dbl> 2.30706285, 0.02186195, 0.98940170, 2.9771876…
## $ fractal_dimension_mean  <dbl> 2.31733476, -0.88939910, -0.40635208, 5.04247…
## $ radius_se               <dbl> 2.37985886, 0.47285546, 1.17168663, 0.3072245…
## $ texture_se              <dbl> -0.52600515, -0.83843605, -0.74182673, -0.069…
## $ perimeter_se            <dbl> 2.72216499, 0.25563868, 0.81964691, 0.2779709…
## $ area_se                 <dbl> 2.325575265, 0.686914706, 1.099058909, -0.280…
## $ smoothness_se           <dbl> -0.19032973, -0.58560573, -0.27416595, 0.7224…
## $ compactness_se          <dbl> 1.356680925, -0.683695689, 0.847154173, 2.805…
## $ concavity_se            <dbl> 0.71209089, -0.41400693, 0.21812085, 0.804409…
## $ concave.points_se       <dbl> 0.6825459, 0.2843963, 1.4417704, 1.1338896, -…
## $ symmetry_se             <dbl> 1.172419370, -0.779156596, 0.261925750, 4.751…
## $ fractal_dimension_se    <dbl> 0.913959900, -0.083344862, 0.306057486, 2.043…
## $ radius_worst            <dbl> 1.8696583, 1.7894970, 1.4976275, -0.2823654, …
## $ texture_worst           <dbl> -1.34090113, -0.35703897, -0.01398177, 0.1429…
## $ perimeter_worst         <dbl> 2.30085932, 1.53384537, 1.34655126, -0.247826…
## $ area_worst              <dbl> 1.96064959, 1.85181528, 1.42511569, -0.546512…
## $ smoothness_worst        <dbl> 1.338856915, -0.379130952, 0.542497956, 3.468…
## $ compactness_worst       <dbl> 2.69410378, -0.43561800, 1.11878871, 4.005450…
## $ concavity_worst         <dbl> 2.15333090, -0.14259294, 0.87673281, 2.031285…
## $ concave.points_worst    <dbl> 2.33628782, 1.11076543, 1.99054851, 2.2143529…
## $ symmetry_worst          <dbl> 2.8082563, -0.2375773, 1.1824953, 6.1601542, …
## $ fractal_dimension_worst <dbl> 2.02437558, 0.29713613, 0.21389567, 5.1516732…
## $ diagnosis               <fct> M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, …

6. Predictive models

The train control for all models was created using the cross-validation resampling method with 10 folds.

Model 1: Glmnet

Glmnet fits a generalized linear model via penalized maximum likelihood. The penalization parameters are alpha and lambda, the hypeparameter selection was left by default. These hyperparameters deal with correlated predictors, which is of importance in this project.

##   alpha      lambda
## 5  0.55 0.007608795

The most important variable is radius_worst.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 71  0
##          M  0 42
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9679, 1)
##     No Information Rate : 0.6283     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.6283     
##          Detection Rate : 0.6283     
##    Detection Prevalence : 0.6283     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : B          
## 

Model 2: Random Forest

The Random forest is based on constructing a multitude of decision trees.

##   mtry
## 1    1

The most important variable of the model is area_worst.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 69  3
##          M  2 39
##                                           
##                Accuracy : 0.9558          
##                  95% CI : (0.8998, 0.9855)
##     No Information Rate : 0.6283          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9048          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9718          
##             Specificity : 0.9286          
##          Pos Pred Value : 0.9583          
##          Neg Pred Value : 0.9512          
##              Prevalence : 0.6283          
##          Detection Rate : 0.6106          
##    Detection Prevalence : 0.6372          
##       Balanced Accuracy : 0.9502          
##                                           
##        'Positive' Class : B               
## 

Model 3: AdaBoost

The next algorithm used was AdaBoost. The hypeparameter selection for AdaBoost was left by default, one of the two methods and the number of iterations can be chosen.

##   nIter        method
## 2    50 Real adaboost

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 70  9
##          M  1 33
##                                           
##                Accuracy : 0.9115          
##                  95% CI : (0.8433, 0.9567)
##     No Information Rate : 0.6283          
##     P-Value [Acc > NIR] : 6.062e-12       
##                                           
##                   Kappa : 0.8029          
##                                           
##  Mcnemar's Test P-Value : 0.02686         
##                                           
##             Sensitivity : 0.9859          
##             Specificity : 0.7857          
##          Pos Pred Value : 0.8861          
##          Neg Pred Value : 0.9706          
##              Prevalence : 0.6283          
##          Detection Rate : 0.6195          
##    Detection Prevalence : 0.6991          
##       Balanced Accuracy : 0.8858          
##                                           
##        'Positive' Class : B               
## 

Model 4: Support Vector Machines with Linear and Polynomial Kernel

SVM training algorithm builds a model that assigns new examples to one category or the other. The classification performed can be linear or non-linear. In this case, the hyperparameters are cost, degree and scale. The chosen values are shown on the chunk bellow. The best model was the linear model.

##   degree scale   C
## 7      1     1 0.1

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 71  0
##          M  0 42
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9679, 1)
##     No Information Rate : 0.6283     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.6283     
##          Detection Rate : 0.6283     
##    Detection Prevalence : 0.6283     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : B          
## 

Model 5: Naive Bayes

The Naive Bayes algorithm is based on applying Bayes’ theorem with strong (naïve) independence assumptions between the features. The hypermarameters were left by default. The results obtained are not as favorable compared to the other models probably because it doesn’t consider any kind of correlation between features.

##   fL usekernel adjust
## 2  0      TRUE      1

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 66  2
##          M  5 40
##                                           
##                Accuracy : 0.9381          
##                  95% CI : (0.8765, 0.9747)
##     No Information Rate : 0.6283          
##     P-Value [Acc > NIR] : 1.718e-14       
##                                           
##                   Kappa : 0.8693          
##                                           
##  Mcnemar's Test P-Value : 0.4497          
##                                           
##             Sensitivity : 0.9296          
##             Specificity : 0.9524          
##          Pos Pred Value : 0.9706          
##          Neg Pred Value : 0.8889          
##              Prevalence : 0.6283          
##          Detection Rate : 0.5841          
##    Detection Prevalence : 0.6018          
##       Balanced Accuracy : 0.9410          
##                                           
##        'Positive' Class : B               
## 

Model 6: k-Nearest Neighbors

K-nearest neighbors is an algorithm that classifies new cases based on a similarity measure (distance between neighbors). The hyperparameters were also left by default.

##   kmax distance  kernel
## 2    7        2 optimal

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 70  1
##          M  1 41
##                                           
##                Accuracy : 0.9823          
##                  95% CI : (0.9375, 0.9978)
##     No Information Rate : 0.6283          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9621          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9859          
##             Specificity : 0.9762          
##          Pos Pred Value : 0.9859          
##          Neg Pred Value : 0.9762          
##              Prevalence : 0.6283          
##          Detection Rate : 0.6195          
##    Detection Prevalence : 0.6283          
##       Balanced Accuracy : 0.9811          
##                                           
##        'Positive' Class : B               
## 

Comparison of first models

The accuracy and kappa metrics of the 10-folds for each model is shown on the following table.

The time taken to compute all the folds and the final model for each algorithm is shown on the next table.

Some models obtained an accuracy of 1 on one of the folds. For this reason, in order to decide which model had a better performance the mean accuracy of all folds was computed.

The models with the highest accuracy are the SVM model and the ADA model. The SVM algorithm has the highest mean and takes less time to execute than the ADA algorithm, which is actually the one that takes the longest. Their ROC curves are shown below, the AUC of both models demonstrate their perfect performance.

7. Predictive models with PCA

The same six type of models were created after applying the PCA transformation to the data.

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         30
## 
## Training data contained 456 data points and no missing data.
## 
## Operations:
## 
## Centering and scaling for radius_mean, texture_mean, ... [trained]
## PCA extraction with radius_mean, texture_mean, ... [trained]

Comparison of models with PCA

Metrics…

Time…

As seen on the graph below, the SVM model and the ADA model still outperform all other models. The first SVM model still has the highest accuracy mean indicating that reducing dimensionality does not necessarily improve the model, regardless of the number of variables or the correlation. If execution time is taken into account the GLMpca model is much faster and also has a good performance.

Finally, the top two algorithms (SVM and ADA) were again used to create two models that take into account the class imbalance. All the features were taken into account.

8. Best predictive models with weights

The weight of each class has to be computed before implementing the algorithm.

## # A tibble: 456 x 1
##    radius_mean
##          <dbl>
##  1      -0.161
##  2      -0.292
##  3      -1.31 
##  4      -1.68 
##  5      -0.584
##  6      -0.667
##  7      -0.133
##  8      -0.615
##  9      -0.735
## 10      -0.300
## # … with 446 more rows

Final comparison

Metrics…

Time…

Surprisingly, the SVM model that takes into account the proportion of samples in each class has the same accuracy mean as the SVM model but takes a bit less time to compute.

9. Conclusions

As seen, the algorithm that best predicted the outcome of this dataset was SVM and the addition of weights had a positive impact on the model. The ADA model also has an outstanding performance but takes a lot of time. The time of execution is of great importance in some applications. So if time is relavant, the Glmnet with PCA is also an appropiate choice. Reducing the number of features, reduces the time of execution of the algorithm. The Glmnet model also has high accuracy and high AUC in all folds performed. In the future, other algorithms can be tested or other hyperparameters can be chosen in order to find better results. Other approaches can also be taken in order to deal with multicollinearity.